org-page

static site generator

Stanford Machine Learning Notes

Introduciton

What is machine learning

  • Arthur Samuel (1959): Field of study of that gives computers the ability to learn without being explicitly programmed.
  • Tom Mitchel (1998): A Well-posed Learning Problem is *A computer program is said to learn from experience E with repsect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine learning algorithms:

We mainly talks about two types of algorithm:

  • Supervised Learning
  • Unsupervised Learning
  • Others: Reinforcement Learning, Recommender System.

And Practical advice for applying learning algorithm

Supervised Learning

"Right Answers" are given.

  • Regression (Housing price prediction)

    Predict continuous valued output (price) screenshot-01.png

  • Classification (Breast cancer (malignant, benign)

    Discrete valued output (0 or 1) screenshot-02.png

Unsupervised Learning

No knowledge is given.

  • Google News Grouping

    screenshot-03.png

  • Cocktail party problem

    screenshot-04.png

Linear Regression (Week 1)

Model Representation

Notations:

  1. \(m\) is Number of training examples.
  2. \(x\)'s is "input" variable / feature.
  3. \(y\)'s is "output" variable / "target" variables.
  4. \((x,y)\) is one training example.
  5. \((x^{(i)},y^{(i)})\) is the \(i\)th example, \(i\) starts from \(0\).
  6. \(h\) is called hypothesis, it maps the input to output. In this lecture, we represent \(h\) using linear function, thus it's called linear regression. For linear regression with one variable, it's called univariate linear regression. For example, the univariate linear regression is usually written as: \(h_\theta (x) = \theta_0 + \theta_1 x\). The \(\theta\) here represents the coefficient variables. Sometimes it's written as \(h(x)\) as a shorthand.

Cost Function

Since the hypothesis is written as \(h_\theta (x) = \theta_0 + \theta_1 x\), where \(\theta_i\)'s are the parameters, then how to choose \(\theta_i\)'s? The idea is to choose \(\theta_0, \theta_1\) so that \(h_\theta (x)\) is close to \(y\) for our training examples \((x,y)\). More formally, we want to \[ \underset{\theta_0,\theta_1}{\text{minimize}} \frac{1}{2m}\sum_{i=1}^{m}\left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 \], where \(m\) is the number of trainnig examples. To recap, we're minimizing half of the averaging error. Note that the variable here are \(\theta\)s, and \(x\) and \(y\)'s are constants.

By convention, we define the cost function \(J(\theta_0,\theta_1)\) to represent the objective function. That is

\begin{gather*} J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}\left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 \\ \underset{\theta_0,\theta_1}{\text{minimize}} J(\theta_0,\theta_1) \end{gather*}

This cost function is also called squared error function. There are other cost functions, but it turns out that squared error function is a resonable choice and will work for most of regression problem.

Cost Function Intuition

Before getting a intuition about the cost function, let's have a recap, we now have

  1. Hypothesis: \[h_\theta (x) = \theta_0 + \theta_1 x\]
  2. Parameters: \[\theta_0,\theta_1\]
  3. Cost Function: \[J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}\left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 \]
  4. Goal: \[\underset{\theta_0,\theta_1}{\text{minimize}} J(\theta_0,\theta_1)\]

In order to visualize our cost function, we use a simplified hypothesis function: \(h_\theta (x) = \theta_1 x\), which sets \(\theta_0\) to \(0\). So now we have

  1. Hypothesis: \[h_\theta (x) = \theta_1 x\]
  2. Parameters: \[\theta_1\]
  3. Cost Function: \[J(\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}\left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 \]
  4. Goal: \[\underset{\theta_1}{\text{minimize}} J(\theta_1)\]

So now let's compare function \(h_\theta (x)\) and function \(J(\theta_1)\): screenshot-05.png

Then let's come back to the original function, where we don't have the constrain that \(\theta_0 = 0\). The comparison is like screenshot-06.png

Gradient Descent

Now we have some function \(J(\theta_0,\theta_1)\) and we want to \(\underset{\theta_0,\theta_1}{\text{minimize}}J(\theta_0,\theta_1)\), we use gradient descent here, which

  1. Start with some \(\theta_0,\theta_1\),
  2. Keep changing \(\theta_0,\theta_1\) to reduce \(J(\theta_0,\theta_1)\), until we hopefully end up at a minimum.

To help understand gradient descent, suppose you are standing at one point on the hill, and you want to take a small step to step downhill as quickly as possible, then you would choose the deepest direction to downhill. screenshot-07.png You keep doing this until to get to a local minimum. screenshot-08.png

But if you start with a different initial position, gradient descent will take you to a (very) different position. screenshot-08.png

Gradient Descent algorithm

We use \(a := b\) to represent assignment and \( a = b\) to represent truth assertion.

The \(\alpha\) here is called learning rate.

Pay attention that when implementing gradient descent, we need to update all \(\theta\text{s}\) simultaneous. screenshot-10.png

Recall that \(\alpha\) is a called the learning rate, which is actually a scale factor to our step represented by the derivative term. Take a 1D case as an example, the derivative is the direction (slop of the tanget line) where the function value becomes larger, so we should take its negative as our step. screenshot-11.png

If \(\alpha\) is too small, gradient descent can be slow. If \(\alpha\) is too large, gradient can overshoot the minimum. It may fail to converge, or even diverge. screenshot-12.png

Gradient descent can converge to a local minimum, even with the learning rate \(\alpha\) fixed. This is because as we approach a local minimum, gradient descent will automatically take smaller steps. So no need to descrease \(\alpha\) over time.

Batch Gradient descent: Each step of gradient descent uses all the training examples:

Linear Algebra (Week1, Optional)

This lecture use 1-indexed subscripts.

Linear Regression with Multiple Variables (Week 2)

Multiple Features

screenshot-13.png Notation:

  1. number of features: \(n\)
  2. input (features) of \(i^{\text{th}}\) training example: \(x^{(i)}\)
  3. value of feature \(j\) in \(i^{th}\) training example. \(x^{(i)}_{j}\)

Now our hypothesis is \(h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n \). For convenience of notation, define \(x_0 = 1\), or \(x^{(i)}_0 = 1\). So now we define our hypothesis as

\begin{equation*} \begin{split} h_\theta (x) &= \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n \\ &= \theta x \end{split} \end{equation*}

where \(\theta\) is a \(n+1\) dimension vector.

Gradient Descent for Multiple Variables

So now our new algorithm becomes screenshot-14.png Note that our new algorithm for \(\theta_0\) is just like the old one since \(x_0^{(i)} \) is 1.

Gradient Descent in Practice - Feature Scaling.

Idea: Make sure features are on a similar scale, then gradient descent will converge more quickly.

Take a 2D example, if the dimension of \(x_1\) is much larger than the dimension of \(x_2\), then the search region is a long ellipsis shape which is make gradient much difficult to find the minimum. screenshot-15.png We can rescale all features to [0,1] so the contours now become a circle. screenshot-16.png

Usually we get every feature into approximately a \(-1 \leq x \leq 1\) range. But it need not to be exactly. Say \(-3 \leq x \leq 3 \) is OK.

Mean normalization: Replace \(x_i\) with \(x_i - \mu_i\) to make features have approximately zero mean. Note that this does not apply to \(x_0 = 1\).

Gradient Descent in Practice - Learning Rate.

To make sure gradient descent is working correctly, draw the figure the value of \(J(\theta)\) versus the number of iterations. screenshot-17.png

For sufficiently small \(\alpha\), \(J(\theta\) should decrease on every iteration. But if \(\alpha\) is too small, gradient descent can converge too slow. To choose \(\alpha\) try \( \dots, 0.001, 0.003,0.01, 0.03,0.1, 0.3,1, \dots\), roughly a 3x larger.

Features and Polynomial Regression

You need to choose a good feature instead of just using what you're provided. For example, for housing price prediction, you are provided with the frontage and depth feature, you can define a feature called area = frontage x depth. screenshot-18.png

Or sometimes use a polynomial function would be better. If the feature is not enough, you could use size, size^2, size^3 as features. screenshot-18.png

Or you can use square root as feature. screenshot-19.png How to find the minimum of \(J(\theta)\) analytically?

By Calculus, we can take the partial derivatives of each variable, and set it to 0.

Normal Equation: screenshot-22.png Then we can compute \(\theta = (X^T X)^{-1} X^T y\) The Octave code is:

pinv(X' * X) * X' * y

Compare Gradient Descent with Normal Equation. screenshot-20.png

Normal Equation and Noninvertbility (Optional)

Use pinv in Octave should not be a problem. What if \(X^T X\) is non-invertible?

  1. Reduent features (linearyly dependant). E.g. \(x_1\) = size in feet, \(x_2\) size in m^2.
  2. Too many features (e.g. $m ≤ n). Delete some features, or use regulariation.

Octave Tutorial

Basic Operation

5+6
3-2
5*8
1/2
2^6


1 == 2 %false
1 ~= 2 
1 && 0
1 || 0

xor(1,0) %XOR

PS1('>> '); % to change the prompt sign

a = 3; % semicolon supressing output
a

disp(a);
disp(sprintf('2 decimals: %0.2f', a))

a
format long
a
format short
a

A = [1 2; 3 4; 5 6]

A = [1 2
     3 4
     5 6]

v = [1 2 3]

v = 1:0.1:2

ones(2,3)

C = 2*ones(2,3)

C = [2 2 2; 2 2 2]

w = ones(1,3)
w = zeros(1,3)
w = rand(1,3)

rand(3,3)
rand(3,3)

w = randn(1,3) % guassian distribution

w = -6 + sqrt(10)*(randn(1,10000)); % you don't want to omit the semicolon here

hist(w)
hist(w,50)

eye(4) % identity matrix
I = eye(6)

help eye

/assets/blog/2015/01/16/Stanford Machine Learning Notes/chart.png

Move data around

A = [1 2; 3 5; 5 6]
size(A)
sz = size(A)
size(A,1)
v = [1 2 3 4]
length(v)
length(A)
length([1;2;3;4;5])
pwd
cd ..
ls
load('test.dat')
who
whos
save hello.mat v
save hello.txt v -ascii
clear
A = [1 2 ; 3 4; 5 6];
A(3,2)
A(:,2)
A([1 3], :)
A(:,2) = [10; 11; 12]
A = [A, [ 100, 101, 102]];
A(:) % put all elements of A into a single vector
A = [1 2; 3 4; 5 7];
B = [11 12; 13 14; 15 16];
C = [A B]
C = [A,B]
D = [A; B]

Computing on Data

A = [1 2; 3 4; 5 6];
B = [11 12; 13 14; 15 16];
C = [11 12; 13 14]

A.*B
A.^2
v = [1; 2; 3]
1 ./ v
1 ./ A
log(v)
exp(v)
abs(v)
abs([-1; 2; -3])
-v
v + ones(length(v),1)
v + 1

A'
(A')'

a = [1 15 2 0.5]
val = max(a)
[val, ind] = max(a)
max(A)
a < 3
A = magic(3)
[r,c] = find(A >= 7)

sum(a)
prod(a)
floor(a)
ceil(a)

max(A,[],1) % colomn wise max
max(A,[],2) % row wise max
max(A)
max(max(A))
max(A(:)) % find max of all the elements

A = magic(9)
sum(A,1) % column wise sum

sum(sum(A .* (eye(9)))) % sum the diagonal values
sum(sum(A .*flipud(eye(9)))) % sum the subdiagonal values

pinv(A)                         # sudo inverse
temp = pinv(A)
temp * A

Plotting Data

t = [0:0.01:0.98];
y1 = sin(2*pi*4*t);
plot(t,y1);
y2 = cos(2*pi*4*t);
plot(t,y2);

plot(t,y1);
hold on;
plot(t,y2,'r');
xlabel('time');
ylabel('value');
legend('sin','cos');
title('my plot');
print -dpng 'myPlot.png'
close


figure(1); plot(t,y1);
figure(2); plot(t,y2);

subplot(1,2,1)% divides plot into a 1x2 grid, access first element
plot(t,y1);
subplot(1,2,2);
plot(t,y2);
axis([0.5 1 -1 1]) % change horizontal range to [0.5,1] and vertical to [-1,1]

A = magic(5)
imagesc(A)
imagesc(A), colorbar, colormap gray; % use comma for command chainning, for ouput, which is different from semicolon

Control Statements

v = zeros(10,1)
for i = 1:10,
    v(i) = 2^i;
end;

i = 1;
while i <= 5,
      v(i) = 100;
end


i = 1;
while true,
      v(i) = 999;
      i = i + 1;
      if( i == 6),
        break;
      end;
end;

Vectorizatrion

\(h_\theta(x) = \sum_{j=0}^{n} \theta_j x_j = \theta^T x\)

%unvectorized implemenetation
  predictaion = 0.0;
  for j = 1;n+1,
      prediction = prediction + theta(j) * x(j)
  end;
  %vectorized implementation
  prediction = thteta' * x;

Logistic Regression (Week 3)

Classification (8min)

Applying linear regression to a classification problem is not a good idea. screenshot-23.png

You can use 0.5 as a threshhold to do the prediction based on the h_&theta;(x) value. However, this is not a good idea since when adding a new sample point, the hypothesis will change. Besides, the return value for h_&theta;(x) could be not in the range [0, 1], thus making the prediction rather confusing. Logistic Regression, though the word regression, is used to do the classification job and the hypothesis can be guaranteed in the range [0,1]. screenshot-24.png

Hypothesis Representation

Logistic Regression Model

We set \(h_\theta(x) = g(\theta^Tx)\), where \(g(z) = \frac{1}{1+e^{-z}}\). The \(g\) function is called Sigmoid function as well as Logistic function. screenshot-25.png

Interpretation of Hypothesis Output

You can think of \(h_\theta(x)\) as the estimated probability that \(y = 1\) on input \(x\). screenshot-26.png

Decision Boundary

It's a line that separates the regions where the hypothesis predicts 1 or 0.

Since \(g(z) \geq 0.5\) when \(z \geq 0.5\), which means that \(h_\theta(x) \geq 0.5\) whenever \(\theta^{T} x \geq 0\)

Cost Function

How to fit the parameters θ for Logistic Regression? We define the cost function as below since \(h_\theta(x)\) is in the range [0,1]. screenshot-27.png

Simplified Cost Function and Gradient Descent

Remember that the Logistic regression cost function is in two parts, since y can be 0 or 1 only, we can write the cost function in a new way: screenshot-28.png

Remember that gradient descent is like: screenshot-29.png

where the partial derivatives is like: screenshot-30.png

Advance Optimization

screenshot-31.png

Note that the Octave start from 1 instead of 0. screenshot-32.png

Multiclass Classification: One-vs-all

One-vs-all is also called one-vs-rest.

screenshot-33.png

The method is: screenshot-34.png

Regulation (Week 3)

The problem of Overfitting

screenshot-35.png

screenshot-36.png

For Logistic regression: screenshot-37.png

How to solve the problem? screenshot-38.png

Cost Function

screenshot-39.png The idea behind regulation is that: screenshot-40.png Note that we didn't penalize θ_0

screenshot-41.png

What if λ is too large? screenshot-42.png

Regularized linear regression

screenshot-43.png

When using normal equation: screenshot-44.png

Regularized Logistic Regression

Now update θ_0 separately screenshot-45.png

Neural Networks: Representation

Non-linear Hypotheses

Use quadratic features will results in a lot of features. screenshot-46.png

Neurons and the Brain

There is one learning algorithm of the brain

Model Representation I

Neuron model: Logistic unit screenshot-47.png

Neural Network is a group of this neural. screenshot-48.png

screenshot-49.png

Model Representation II

Forward propagation: Vectorized implementation screenshot-50.png

Bias unit: screenshot-51.png

The last layer is like Logistic Regression, but Neural Network learning its own features. screenshot-52.png

Examples and Intuitions I

OR Function screenshot-53.png

Examples and Intuitions II

NOT Function screenshot-54.png

The input layer has only x_1 and x_2. The second layer compute the feature (x_1 AND x_2), and the feature (NOT x_1) AND (NOT x_2). The third layer use the feature computed in layer 2 and do an OR operation to simulates a XNOR operation. screenshot-55.png

Handwritten digit classification screenshot-56.png

Mutilcass Classification

Use an extension of the One-vs-all method. screenshot-57.png

Neural Networks: Learning (Week 5)

Cost Function

screenshot-58.png

This is the cost function screenshot-59.png

Backpropagation Algorithm

Forward Propagation, computing all the activations: screenshot-60.png

Backpropagation algorithm, computing the gradient: screenshot-61.png

screenshot-62.png

The name "back" comes from the order we compute the "error" Backpropagation algorithm screenshot-63.png

Backpropagation Intuition

screenshot-64.png screenshot-65.png

Implementation Notes Unrolling Parameters

In Neural Network our parameters are matrices, and we need to unroll it into vectors. screenshot-66.png

Gradient Checking

Use nummerical estimation of gradients to check.

Random Initialization

We need to provide an initial value for gradient fescent and advanced optimization method.

Zero Initialization will result in identical values.

To solve this problem, we use random initialization to break the symmerty. screenshot-69.png

Put it together

screenshot-70.png

How to train a neural network screenshot-71.png screenshot-72.png

Advice for Applying machine Learning

Deciding What to Try Next

  1. Get more training examples
  2. Try smaller sets of features
  3. Try getting additional features.
  4. Try adding polynomial features
  5. Try decreasing λ.
  6. Try increasing λ.

Evaluating a Hypothesis

Hypothesis can be overfitting.

We can evaluate our hypothesis by separate our date into training set and test set.

Model Selection and Train/Validation/Test Sets

screenshot-73.png

screenshot-74.png To address this problem, we divide our data set into three parts: training set, cross validation set, and test set. screenshot-76.png screenshot-75.png

Use cross validation error to selection the model degree and use test set error to select θ.

Diagonosing Bias vs Variance

Bias (undearfit) Variance (overfit) screenshot-77.png

Comments

comments powered by Disqus